Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running replicates statistics on bins and bed files #80

Merged
merged 6 commits into from
Oct 7, 2020
Merged

Conversation

cnluzon
Copy link
Collaborator

@cnluzon cnluzon commented Oct 7, 2020

Included a bwstats.R with functionality to compute statistics based on DESeq2 package for both bins and BED files.

Added functions: bw_bed_diff_analysis, bw_bins_diff_analysis. These operate on two lists of bigWig files, plus labels (labels are mandatory at this point because I believe this makes analysis more robust to mistakes like forgetting what was compared against what). So at this point you need to provide at least "treated" "untreated" or some labels to identify the groups of bigWig files.

This is a work in progress. I still need to add some automated testing. Even though at this point I am not including actual values testing (relying on DESeq2) I am going to include at least tests to check that it runs and that the parameters passed go where they are supposed to go.

Since we mostly operate on scaled bigWig files, the estimateSizeFactors step that usually is performed with DESeq2 is overridden. But a parameter estimate_size_factors is provided:

bw_bins_diff_analysis(bwlist_1, bwlist2, "treated", "untreated", estimate_size_factors = TRUE) 

will run a normal DESeq2 function. I provide this for the cases where we are not looking at scaled bigWig files. But the default value of this parameter is set to FALSE.

Returning value of these functions is a results table as the one obtained when you call DESeq2::results, which means that for any analysis you still need to set some cutoff thresholds for pvalue and / or fold change.

@shaorray
Copy link
Contributor

shaorray commented Oct 7, 2020

One detail about DESeq's testing: the input counts need to be an integer larger than 1, better to be above 100.

Since it transforms read counts (or bin counts) by a hyperbolic function, smaller numbers (1, 10) are more sensitive for DESeq than larger ones, like 1000.

Because rsamtools::summary gives read density in the bins, the counts are normally below 1. So either multiplying the bin width or any large constant number will be a good idea to avoid artefacts.

@cnluzon
Copy link
Collaborator Author

cnluzon commented Oct 7, 2020

My idea was to simulate the "real" number of reads using estimated fragment length and bin size but then I thought a constant value independent of length may be better if a BED file is used that is not bins but something else where length may vary. Of course one could argue that in any case if there are large differences in length it would affect results in both ways.

I will use a larger constant for the count matrix to ensure results are more robust. Thanks!

@cnluzon cnluzon merged commit 1f4993c into master Oct 7, 2020
@cnluzon cnluzon deleted the deseq-func branch October 7, 2020 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants